Setting up containers with systemd-nspawn

An aerial view of a container dock with a bunch of shipping containersAn aerial view of a container dock with a bunch of shipping containers
 

There has been a lot of buzz around containers for quite a while now. Often seen as a “VM light” (virtual machine light), they allow users to run services in dedicated environments, isolating them from each other and the host system on several layers.

Some people will probably now think: “Hey, you can’t just compare virtual machines and containers like that! They’re totally different!”. And I agree, from a technical point of view, they are quite different. Containers run on the same operating system kernel as the host, for example. However, when only looking at what means of isolation the two provide, the comparison still makes sense.

If you already read a bit about containers, you have probably stumbled upon Docker, the most popular container software right now, or Kubernetes, which is often used in conjunction with Docker to orchestrate environments that employ a plethora of different containers. And while Docker and Kubernetes are great in their own right, one may not want to install and set up a full Docker runtime just to containerize a process.

What many people don’t know is that systemd actually has its own suite of tools to set up containers. If you’re on an up-to-date linux distribution, e.g. Debian 10 (buster), you probably already have systemd installed anyway, so you have very little additional stuff to install in order to get containers working. In this article, I want to show you how to get a simple container up and running using the tools systemd provides.

How do containers work?

The best way to imagine what containers do is to think in terms of namespaces. There are several namespaces in your system that you can restrict access to. Let’s look at some of them:

Filesystem. One of the most important steps to isolate a service is to restrict filesystem access. Containers provide chroot-like functionality, allowing you to specify an arbitrary filesystem root. This means that you can decide which programs you want installed in your container fully independent from the host system. However, it also means that you need to set up a second OS tree for the container containing all the tools an operating system needs to run. But this is easy, as we will see later. In the end, you will have a typical Unix filesystem hierarchy inside your specified root directory, containing /bin, /etc, /home, /usr and so on.

Process IDs (PIDs). When you run ps -ef, you see every process running on the system, together with their unique PIDs. However, we don’t want something running in the container to see processes from the host or other containers. In addition, we want a process in the container to be able to have e.g. PID 1 (init), even though we already have a PID 1 on the host.

User IDs (UIDs). Every user on a system has a user ID. The root user typically has ID 0, while normal user IDs start at 1000. User namespacing is mostly a precaution; should someone be able to break out of your container into the host system, he will not have the UID of an existing user on the host. Without user namespacing, e.g. the root user in a container will also appear to the host system as UID 0. Enabling the feature adds an arbitrary high number to all UIDs inside a container, so the container’s root user (appearing as UID 0 in the container) will actually have UID 64789232 in the host system (for example).

Network interfaces. Running ip addr on the host shows you all available network interfaces on your machine. Running the same command in a container will give you the exact same result, meaning the container has the same network access privileges as the host. To prevent this, we can restrict the container from accessing the host’s network interfaces and add a bridge for the container to allow the container to still access the internet and local network.

All the namespaces are managed by the Linux kernel. It is possible to enable or disable support for these namespaces at kernel compile time. If some of these are not working for you, perhaps they have been turned off or your kernel is too old.

A note on single process containers

Some people in the Docker community advertise the “one process per container” model, meaning that each container you set up is dedicated to running exactly one process inside of it. However, as this article also suggests, it is a good idea to have at least a minimal init system in your container. The problem is that processes may fork() child processes, and when a parent does not properly wait() for its children, the process with PID 1 (the init process) becomes responsible for cleanup duty. But if there is no init process, these orphaned child processes will become zombies. Some processes rely on init as a zombie reaper, e.g. daemons when double forking. In this guide, we will have systemd as the init system inside our container.

Creating a container step by step

I will show how to create containers on Debian, but the steps should be similar for other Linux distributions. Note that you need root privileges to run the commands in the following sections.

Installing required packages

We need some packages to install and run the container. Let’s install them first:

# apt install debootstrap systemd-container bridge-utils

debootstrap installs a minimal Debian system into a custom directory. systemd-container contains the systemd tools to run and configure containers. bridge-utils allows for easy setup of a bridge to give the container network access.

Set up the OS tree

First, we need to set up an OS tree in an empty directory, which will serve as the container’s root directory. To achieve this, we will use debootstrap.

systemd expects containers to be located inside the /var/lib/machines directory. They can be elsewhere, but then some tools won’t automatically recognize the container. Let’s set up a container named helloworld:

# mkdir -p /var/lib/machines/helloworld

# debootstrap stable /var/lib/machines/helloworld http://deb.debian.org/debian/

This will install everything required to run Debian into our new container (may take some time).

You should also make sure that only root can access the machines directory:

# chown root:root /var/lib/machines

# chmod 700 /var/lib/machines

An overview of systemd’s container commands

Now that we have the operating system set up, it is a good time to introduce the commands that you will use to control your container. I will explain them in more detail as they are used later in this guide.

systemd-nspawn is used to directly start a container. By default, it just spawns a root shell inside the container’s environment. Using the -b switch will instead boot the container and afterwards give you a login shell. However, as soon as the command terminates, the container shuts down as well. Also, when you want to use user namespacing, I strongly recommend to always launch this with the -U switch, which enables user namespacing. The first time you do this, systemd will adjust all file permissions inside the container, and disabling user namespacing later can lead to some very weird errors (and files being owned by nobody:nogroup). Finally, with the -M machine_name switch, you specify the container to launch.

machinectl is very similar to systemd’s systemctl, but is used specifically to work with containers. In fact, containers can be started, stopped and enabled just like any other systemd service. The service template file is located at /lib/systemd/system/systemd-nspawn@.service. It is worth it to take a look inside, especially the specified executable:

[Service]

ExecStart=/usr/bin/systemd-nspawn --quiet --keep-unit --boot --link-journal=try-guest --network-veth -U --settings=override --machine=%i

As you can see, there are already quite a few command line arguments specified by default when you use machinectl to start a container. One of them is --boot (the same as -b), so the container will boot and not just spawn a root shell. --network-veth creates a virtual ethernet interface (without bridging) in the container, but more on that later. Note that -U is specified, so user namespacing will be enabled.

/etc/systemd/nspawn is a directory which can contain configuration files for container-specific settings. All the settings can also be specified via the systemd-nspawn command line, but this saves you from typing them out every time. In addition, the configuration files are also honored when using machinectl to start containers. If the directory doesn’t exist, just create it using

# mkdir /etc/systemd/nspawn

Configuring the container

Now we have an OS tree set up and you have basic knowledge of systemd’s tools. So let’s fire up the container.

The first thing we do is set a root password. To do so, we’ll obtain a root shell without booting the container:

# systemd-nspawn -UM helloworld

Your command prompt should now look like this (or at least similar):

root@helloworld:~# _

Now set the password using

# passwd

Type in your root password and exit the container again:

# exit

Now let’s see if we can boot the whole thing:

# systemd-nspawn -UbM helloworld

You should see the usual system startup messages. If everything works, after a few seconds you should see a login shell. Log in with root and the password you just set.

Next, I recommend that you install dbus. It is required if you want to log into running containers using machinectl on the host.

# apt install dbus

Right now, the container should see the same network interfaces as the host. You can use ip addr to verify this. We’ll set up virtual ethernet in the next section. However, we can already tell the container to automatically bring up the virtual ethernet interface:

# echo 'auto host0' >> /etc/network/interfaces

# echo 'iface host0 inet dhcp' >> /etc/network/interfaces

You should also change the hostname of the container, or it will have the same one as the host. I will name it helloworld, but you can use whatever you want:

# echo 'helloworld' > /etc/hostname

Okay, let’s exit the container for now:

# poweroff

Configuring the host

Now let’s try to get networking up and running for the container. To achieve this, we must first configure the host system appropriately.

In this guide, we will set up a network bridge on the host and connect the container’s virtual ethernet interface to the bridge. This means that the host and the container will be visible to the network as two different machines. If you connect your computer to a router, it will assign a separate IP address to the container.

Note that this method will probably not work for you if you are on a server with a static IP address. In this case, you will need to set up NAT on your host or some kind of proxy, which will not be covered here.

First, let’s tell systemd that our container will use a network bridge from now on. This is supported directly by systemd; it will take care of connecting the container’s virtual ethernet to the host’s bridge interface when the container is started. To configure this, open /etc/systemd/nspawn/helloworld.nspawn with your favorite text editor and add the following lines:

[Network]

VirtualEthernet=yes

Bridge=br0

This will instruct systemd to connect the container’s host0 virtual ethernet to the host’s network interface br0.

Now we have a connection between container and host, but we still need to create and configure the bridge br0 so the container can actually access the internet. Ensure that IPv4 packet forwarding is enabled on the host: open /etc/sysctl.conf with your favorite text editor and look for net.ipv4.ip_forward. Make sure it is set to 1 and not commented out:

net.ipv4.ip_forward=1

For the remainder of this guide, I will assume that your primary network interface on the host is called eth0. If this is not the case, replace eth0 with the correct name accordingly.

Now open /etc/network/interfaces and add the following lines:

auto br0

iface br0 inet dhcp

    bridge_ports eth0

There should already be an interface definition for eth0 in this file. Ensure that it is set to manual by changing its definition to:

iface eth0 inet manual

Note that if you have network-manager or something similar installed, it may interfere with our network setup. You will probably need to disable it or configure network-manager to not touch the interfaces we are using here.

Alright, now bring up our bridge:

# ifdown eth0

# ifup br0

Use ip addr to see if everything worked as intended. The output should look similar to this (emphasis by me):

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1

    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

    inet 127.0.0.1/8 scope host lo

       valid_lft forever preferred_lft forever

    inet6 ::1/128 scope host

       valid_lft forever preferred_lft forever

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP group default qlen 1000

    link/ether 08:00:27:3e:cf:c0 brd ff:ff:ff:ff:ff:ff

    inet 10.0.2.15/24 brd 10.0.2.255 scope global enp0s3

       valid_lft forever preferred_lft forever

3: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000

    link/ether 08:00:27:3e:cf:c0 brd ff:ff:ff:ff:ff:ff

    inet 10.0.2.15/24 brd 10.0.2.255 scope global br0

       valid_lft forever preferred_lft forever

    inet6 fe80::a00:27ff:fe3e:cfc0/64 scope link

       valid_lft forever preferred_lft forever

Alright, we should now have internet access on both the host and the container. Let’s verify this:

$ ping google.com -c 4

# systemd-nspawn -UbM helloworld

$ ping google.com -c 4

Both pings should work. Let’s exit the container again:

# poweroff

Running containers in the background

We are now able to run containers and gave them internet access via a network bridge. This is a great start, but we probably want more. If you plan to use the container e.g. to run an isolated web server inside, you most likely want the container to run in the background at all times, and also to automatically boot every time the host machine is (re-)booted.

As I mentioned earlier in this article, systemd containers work just like other systemd services, which means we can start, stop, enable, or disable them.

To start a container in the background:

# machinectl start helloworld

To get status information for a running container:

# machinectl status helloworld

To obtain a login shell on a running container (requires dbus on both host and container):

# machinectl login helloworld

You can leave the session (as systemd kindly informs you) by pressing Ctrl + ] 3 times within one second.

To shut down a running container:

# machinectl stop helloworld

To configure a container to boot every time the system boots:

# machinectl enable helloworld

… and to disable autostart again:

# machinectl disable helloworld

That’s all for now. Have fun with your awesome new containers!

Troubleshooting

Q: I get an error “failed to become subreaper” when starting the container. What can I do?

A: Your Linux kernel is probably too old, you need version ≥ 3.4 to run containers. See this stackoverflow post for more information on subreapers.

Useful manpages